improve observability over exceptions#1730
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 99d816eb7f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (leaf_exists) | ||
| ops.emplace_back(zkutil::makeSetRequest(last_exception_path, entry.toJsonString(), -1)); | ||
| else | ||
| { | ||
| LOG_INFO(log, "ExportPartition: Exceptions per replica path does not exist, will create it"); | ||
| ops.emplace_back(zkutil::makeCreateRequest(exceptions_per_replica_path, "", zkutil::CreateMode::Persistent)); | ||
| ops.emplace_back(zkutil::makeCreateRequest(count_path, "1", zkutil::CreateMode::Persistent)); | ||
| ops.emplace_back(zkutil::makeCreateRequest(last_exception_path, "", zkutil::CreateMode::Persistent)); | ||
| ops.emplace_back(zkutil::makeCreateRequest(last_exception_path / "part", part_name, zkutil::CreateMode::Persistent)); | ||
| ops.emplace_back(zkutil::makeCreateRequest(last_exception_path / "exception", exception_message, zkutil::CreateMode::Persistent)); | ||
| } | ||
| ops.emplace_back(zkutil::makeCreateRequest(last_exception_path, entry.toJsonString(), zkutil::CreateMode::Persistent)); |
There was a problem hiding this comment.
Create missing last_exception container for legacy exports
For exports that were created before this change, ZooKeeper contains .../exceptions_per_replica but not .../last_exception. In that upgrade scenario, this branch issues Create(.../last_exception/<replica>) without creating the parent path first, so Keeper returns ZNONODE and the surrounding tryMulti fails atomically. Because appendExceptionOps is used inside failure-handling multis (part failure bookkeeping, commit failure retries, and timeout-to-KILLED), those state transitions can be skipped repeatedly for in-flight legacy tasks, leaving them stuck in PENDING during failures.
Useful? React with 👍 / 👎.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Observability over exceptions on export partition operations was broken and not reliable. This PR simplifies it by making it a single JSON array instead of multiple paths in zookeeper. Another change is that now
system.replicated_partition_exportsreturns only local information, it does not reach zookeeper. #1716Documentation entry for user-facing changes
...
CI/CD Options
Exclude tests:
Regression jobs to run: